Skip to content

Fixes for General Additive Models #743

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 25 commits into from
Sep 17, 2018

Conversation

rogancarr
Copy link
Contributor

@rogancarr rogancarr commented Aug 27, 2018

This PR addresses a number of issues with the General Additive Model (GAM) trainer. In particular, it addresses the issues with the GAM Classifier not fitting nor producing a probability, and adds support for validation pruning, summary text, and centering of the feature effects around a mean response (e.g. intercept).

Additionally, the PR addresses some minor issues in the codebase, like GAMs using copy-and-paste versions of FastTree code, unnecessarily public attributes, unused arguments, and splits the BinaryClassifier and Regressor into separate files.

The changes are as follows:

  1. Centered the response and added an intercept term; fixed corresponding issues with table lookups, sparse calculations, and exports. (General Additive Model features are not centered after learning. #739)
  2. Added in a validation set and pruning based on the validation set, producing a final pruned graph at the end. (General Additive Models do not allow pruning with a validation set #740)
    a. Switched to using ScoreTrackers to keep track of scores during boosted
    b. Save boost iterations as individual graphs (n_features x n_iterations x n_boosts) for pruning.
  3. Fixed GAM Classifier to use a small learning rate (FastTree Gradient of Logistic Loss prohibits small learning rates #741)
    Updated the FastTreeBinaryClassification logistic loss gradient to take the sigmoid parameter as input: GAM Binary Classifier now uses unity; FastTree Binary Classifier uses the same default of 2*learning rate (no change). Optionally, we can plumb this to the FastTree arguments if we show an experimental gain; We can also experiment to see if the sigmoid parameter gives gains for GAMs (i.e. by slowing learning even more).
  4. Added calibration to the GamClassifier to produce probabilities (General Additive Models (GAM) for Classification learn the logit, not class probability #738).
    This meant changing the various interfaces.
  5. Refactored GAMRegressor and GAMClassifier into their own files.
    The one file for all GAM trainers had gotten a bit long, and this change is consistent with ML.NET tradition.
  6. Added tests to verify train loss and validation metrics (e.g. same value is produced by boosting and the lookup table).
  7. Added a Summary to the GAMPredictor to produce training statistics and the feature table (General Additive Model (GAM) has no summary to extract features at runtime #742).
  8. Rewrote the core split finding routine to use the same code as FastTree to rely on FastTree unit tests
    The GAM routines used a copy-and-paste of internal FastTree components. To fight entropy, these were refactored to use the same central calculation, such that it is verified and validated by the FastTree unit and end-to-end tests.
  9. Removed unused arguments.
  10. Allowed for weighted samples.

Fixes #738
Fixes #739
Fixes #740
Fixes #741
Fixes #742

Rogan Carr added 20 commits August 19, 2018 15:52
…eeBinaryClassification Loss to take the sigmoid parameter as input; FastTree uses the default, stays the same. Gam uses Unity. Refactored GamRegressor and GamClassifier into their own files. Added tests to verify Train loss and validation metrics.
… 2 into the sigmoid parameter; this allows GAMs to output features on the scale of the logit.
…no graph would be defined. As the graph is still accessed, it must be defined to zero in such cases.
@dnfclas
Copy link

dnfclas commented Aug 27, 2018

CLA assistant check
All CLA requirements met.

@rogancarr rogancarr requested review from TomFinley and Zruty0 August 27, 2018 10:41
@@ -8,6 +8,11 @@
<AllowUnsafeBlocks>true</AllowUnsafeBlocks>
</PropertyGroup>

<ItemGroup>
<None Remove="GamClassification.cs" />
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None Remove="GamClassification.cs" [](start = 5, length = 34)

what is this? Remove

@@ -27,6 +32,8 @@
<Compile Include="FastTreeRegression.cs" />
<Compile Include="FastTree.cs" />
<Compile Include="FastTreeTweedie.cs" />
<Compile Include="GamClassification.cs" />
<Compile Include="GamRegression.cs" />
<Compile Include="GamTrainer.cs" />
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this is somewhat peculiar. I know you didn't do this, but why is the explicit listing in this project necessary at all? Does anything break if you get rid of this entire ItemGroup with all the <Compile Includes?

@TomFinley
Copy link
Contributor

Thanks for doing this @rogancarr -- should I infer from the lack of baseline changes that no tests were migrated to test GAM? I think we probably ought to do some here -- we have some standard regression problems and whatnot.

{
using (var env = new TlcEnvironment())
{
var trainFile = "binary_sine_logistic_10k.tsv";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

binary_sine_logistic_10k [](start = 33, length = 24)

Just a reminder what you need to check in this file into repo, or use build actions to download it.
(if you want to check it in, you also need CELA approval)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Ivanidzo4ka , I believe that this is a synthetic dataset, so probably CELA is not necessary, since it does not come from any real external source? Or am I wrong?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a synthetic dataset. If we want to add this test to the repo, then I'll add the file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But more important question is: Do we need this test?

This is really a test on the internal state of the learner that I used for debugging. If we add baseline tests for GAMs, then it seems like we don't need this test, and we may not want it, since it relies on the internal state.

@TomFinley
Copy link
Contributor

False

I think this is the culprit here, that is requiring the explicit listing.


Refers to: src/Microsoft.ML.FastTree/Microsoft.ML.FastTree.csproj:7 in d0d4e6b. [](commit_id = d0d4e6b, deletion_comment = False)

@rogancarr
Copy link
Contributor Author

Yes, my best guess is that this line:
<EnableDefaultCompileItems>False</EnableDefaultCompileItems>
is why we are explicitly listing the files to compile.

@rogancarr
Copy link
Contributor Author

New commit addressing comments on the PR:

  • Removed the strange remarks in the Project file
  • Updated the CSharpApi to reflect the new arguments for GAM binary classification
  • Updated the entrypoints manifest to reflect the new arguments for GAM binary classification

Next:

  • Updating baseline tests for GAM Binary Classification to reflect the new learner
  • Adding baseline tests for the old GAM Regression; confirming same results for new GAM Regression

… removed TrainingInfo (and therefor the test to validate training loss); removed post-training scoring update; made the statistics calculator interface internal to FastTree.
@rogancarr
Copy link
Contributor Author

rogancarr commented Aug 31, 2018

New commit addressing offline reviews:

  • Switched sub-graph calculation from multidimensional arrays to structs
  • Removed TrainingInfo (and therefor the test to validate training loss): We need a general way to do this and not a one-off solution for one learner.
  • Removed post-training scoring update: At this point it's not necessary and hurts performance
  • Made the statistics calculator interface internal to FastTree: It cannot be an abstract class

Next:

  • Run performance comparisons against the previous version for GAM{Regressor, Binary Classifier} and FastTree{Regressor, and Binary Classifer} (In Progress)
  • Recomputing baselines (note that the tests will fail until this is updated)

@@ -248,7 +248,7 @@ public void CopyFeatureHistogram(int subfeatureIndex, ref PerBinStats[] hist)
double sumGTTargets = 0.0;
double sumGTWeights = eps;
int gtCount = 0;
sumWeights = 2 * eps;
sumWeights += 2 * eps;
double gainShift = leafCalculator.GetLeafSplitGain(totalCount, sumTargets, sumWeights);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

@rogancarr
Copy link
Contributor Author

New commits

  • Fixed an introduced bug in the split finding algorithm for weighted calculations
  • Updated the baseline tests for GamBinaryClassifier

I performed two performance tests, and sent results to reviewers offline:

  1. Determine if the new implementation of GAMs had an effect on goodness-of-fit (AUC, RMS) and training time
  2. Determine if the changes to FastTree had an effect on goodness-of-fit (AUC, RMS) and training time.

Test Results:

  1. GAM changes resulted better goodness of fit for Binary Classification, and equivalent for Regression. Regression was not numerically the same because the order of additions to the final graphs changed. The changes also introduced a slowdown on the order of 1.3 due to the way that graphs are held for pruning (many small pieces instead of one master graph). This can be targeted for future optimization (I have a few ideas).
  2. The changes had no measurable effect on FastTree. Goodness-of-fit was numerically identical and training time was equivalent.

@rogancarr
Copy link
Contributor Author

rogancarr commented Sep 14, 2018

In the build, SDCA failed a test on Linux Debug -- this seems unrelated to the changes in this PR -- is it possible to rerun the tests?

Edit: I reran the tests.

{
[Argument(ArgumentType.AtMostOnce, HelpText = "Metric for pruning. (For regression, 1: L1, 2:L2; default L2)", ShortName = "pmetric")]
[TGUI(Description = "Metric for pruning. (For regression, 1: L1, 2:L2; default L2")]
public int PruningMetrics = 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int PruningMetrics = 2 [](start = 19, length = 22)

wow, really? Can we maybe have enum here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a holdover from FastTree -- this is copied over from FastTreeRegression. It's a parameter to the RegressionTest class.

If possible, I'd like to file this as a separate issue and fix as a separate PR.

Copy link
Contributor

@Zruty0 Zruty0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Copy link
Contributor

@TomFinley TomFinley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @rogancarr , I'm requesting a rebuild due to an unrelated SDCA test failure.

@rogancarr
Copy link
Contributor Author

Thanks for the review @TomFinley and @Zruty0 ! I addressed all but one issue and responded to one in the comments here. Note that WIP is flagged, but this is not WIP -- I had a WIP commit that I didn't remove because we will squash-merge this branch.

@TomFinley TomFinley merged commit 345a5c2 into dotnet:master Sep 17, 2018
@rogancarr rogancarr deleted the roganc/gam_bug_fixes branch April 5, 2019 22:32
@ghost ghost locked as resolved and limited conversation to collaborators Mar 29, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.